Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894

sfawzy-nv · 2026-02-06T18:18:41Z

What does this PR do ?

Use Megatron-Bridge recipes for megatron_cfg.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Release Notes

New Features
- Added recipe-based configuration support for Megatron-LM, allowing simplified model setup through predefined recipe references instead of manual parameter specification.
Chores
- Updated configuration files to use new recipe system, streamlining setup across multiple training scenarios and model variants.

coderabbitai · 2026-02-06T18:26:19Z

📝 Walkthrough

Walkthrough

This PR introduces Megatron-Bridge recipe support by replacing hardcoded megatron_cfg blocks in 50+ YAML config files with megatron_recipe references, adding dynamic recipe loading via a new load_recipe() function, and refactoring the Megatron setup flow to merge pre-built configurations from recipes instead of constructing them from scratch.

Changes

Cohort / File(s)	Summary
YAML Recipe Additions `examples/configs/recipes/llm/_megatron.yaml`, `examples/configs/recipes/vlm/_megatron.yaml`, `examples/configs/_megatron.yaml`	Added `megatron_recipe` field pointing to model-specific Megatron-Bridge pretrain configs (e.g., `llama31_8b_pretrain_config`, `qwen3_30b_pretrain_config`).
YAML Megatron Config Removals `examples/configs/recipes/llm/performance/`, `examples/configs/distillation_.yaml`, `examples/configs/dpo.yaml`, `examples/configs/grpo_.yaml`, `examples/configs/rm.yaml`, `examples/configs/sft.yaml`, `examples/configs/vlm_*.yaml`	Removed nested `megatron_cfg` blocks containing parallelism settings (tensor/model/pipeline/expert sizes), optimizer, scheduler, and DDP configs; eliminates hardcoded parallelism specifications in favor of recipe-driven configuration.
Recipe Loading Module `nemo_rl/models/megatron/recipe_config.py`	New module introducing `load_recipe(recipe_path: str)` function that dynamically imports and invokes Megatron-Bridge recipe functions; includes comprehensive error handling for invalid paths, missing attributes, and non-callable targets.
Megatron Package Exports `nemo_rl/models/megatron/__init__.py`	Added import and re-export of `load_recipe` function for public API access.
Policy Configuration Schema `nemo_rl/models/policy/__init__.py`	Added optional `megatron_recipe: NotRequired[str
Megatron Setup Refactoring `nemo_rl/models/megatron/setup.py`	Major refactoring: introduced conditional recipe loading path (if `megatron_recipe` provided, dynamically load recipe; else fallback to checkpoint config); added `_update_dataclass_fields()` and `_update_megatron_config()` helpers for in-place config modification; refactored validation functions to operate on `ConfigContainer` instead of raw config; adjusted optimizer/scheduler/DDP config merge flow.
Megatron Policy Worker Debug Code `nemo_rl/models/policy/workers/megatron_policy_worker.py`	Added conditional rank-0 diagnostic output (print "HELLO", attempt YAML dump of ConfigContainer) followed by `sys.exit(0)` early termination; introduces early process exit in constructor.

Sequence Diagram

sequenceDiagram
    participant Config as Config Loader
    participant Recipe as Recipe Module
    participant Setup as Megatron Setup
    participant Validation as Validation
    participant ConfigContainer as Config Container

    Config->>Setup: Provide policy config with megatron_recipe
    Setup->>Recipe: Call load_recipe(recipe_path)
    Recipe->>Recipe: Parse fully-qualified path
    Recipe->>Recipe: Dynamic import module
    Recipe->>Recipe: Resolve function name
    Recipe->>ConfigContainer: Invoke recipe function
    ConfigContainer-->>Recipe: Return ConfigContainer
    Recipe-->>Setup: Return loaded megatron_cfg
    Setup->>Setup: Merge checkpoint/policy configs into megatron_cfg
    Setup->>Setup: Update megatron_cfg sub-configs (optimizer, scheduler, ddp, tokenizer)
    Setup->>Validation: Validate optimizer_config(megatron_cfg)
    Setup->>Validation: Validate training_config(megatron_cfg)
    Validation-->>Setup: Validation complete
    Setup-->>Config: Return finalized megatron_cfg

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

feat: add Megatron support for on-policy distillation #1324: Adds Megatron-Bridge recipe support with policy.megatron_recipe and recipe loading infrastructure in the same configuration paths.
feat: refactor megatron init #1646: Modifies Megatron initialization and setup logic in nemo_rl.models.megatron.setup and megatron policy worker initialization.
perf: DeepEP interface in megatron backend #1794: Updates Megatron policy schema (PolicyConfig) and megatron policy worker implementation alongside configuration handling changes.

Suggested labels

documentation, CI:L1

Suggested reviewers

terrykong
ashors1

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR marked as Draft with major Megatron-Bridge feature changes but no test results, performance numbers, or regression analysis documented.	Document comprehensive testing results including unit tests for load_recipe(), functional tests for model loading, convergence regression tests, distributed training validation, and performance benchmarks before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: introducing Megatron-Bridge recipes for megatron_cfg configuration across multiple files and modules.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sfawzy_recipes_2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3.yaml (2)
42-47: ⚠️ Potential issue | 🟡 Minor

Log paths reference .v2 but the recipe filename is .v3.

The log_dir and wandb.name still contain .v2, which will cause confusion when distinguishing runs from the actual v2 config.
Proposed fix
 logger:
-  log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2
+  log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3
   wandb_enabled: true
   tensorboard_enabled: true
   wandb:
     project: nemo-rl
-    name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2
+    name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v3
19-32: ⚠️ Potential issue | 🟡 Minor

Update version references from v2 to v3.

The filename was bumped to .v3 but the log_dir (line 42) and wandb.name (line 47) still reference .v2:
log_dir: logs/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2
wandb.name: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2
Change both to .v3 for consistency with the filename.
examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml (1)

21-24: ⚠️ Potential issue | 🔴 Critical

Use the correct base config for megatron recipes.

The interpolations on lines 21–22 reference ${.megatron_cfg.tensor_model_parallel_size} and ${.megatron_cfg.context_parallel_size}, but the recipe inherits from ../../distillation_math.yaml, which only defines dtensor_cfg and lacks the megatron_cfg block needed for these interpolations to resolve. Change the defaults to ../../distillation_math_megatron.yaml, which provides the required megatron configuration with proper tensor and context parallel sizes.

🤖 Fix all issues with AI agents

In `@examples/configs/grpo_math_70B_megatron.yaml`:
- Around line 25-26: The config mixes Llama 3.3 model with a Llama 3.1 recipe:
update either the model string or the recipe to match versions—specifically
change the model from "meta-llama/Llama-3.3-70B-Instruct" to
"meta-llama/Llama-3.1-70B-Instruct" to match the existing megatron_recipe value
"llama31_70b_pretrain_config", or replace "megatron_recipe:
megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config" with a Llama
3.3–compatible recipe if you have one; ensure consistency across similar configs
like grpo_math_8B_megatron.yaml and sft_openmathinstruct2_megatron.yaml.

In
`@examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml`:
- Line 25: The policy's megatron_recipe is pointing to the 32B config
(megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config) while the policy
model name is Qwen/Qwen3-1.7B-Base; change megatron_recipe to the matching 1.7B
recipe (if one exists) or remove the megatron_recipe entry so the policy uses
the HuggingFace model directly (like the FSDP variant) to ensure the student
model architecture matches Qwen/Qwen3-1.7B-Base.

In
`@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml`:
- Line 11: The recipe filename advertises TP=2 but the config lacks that
override; update the policy section to explicitly set
tensor_model_parallel_size: 2 (ensuring it overrides the base default
tensor_parallel_size: 1) so the effective parallelism matches the filename, or
alternatively rename the file to remove the "megatrontp2" segment; locate the
megatron_recipe entry and the policy block in the YAML (look for keys named
policy, tensor_model_parallel_size, and tensor_parallel_size) and add the
explicit tensor_model_parallel_size: 2 entry or adjust the filename accordingly.

In `@nemo_rl/models/megatron/recipe_config.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file (the
top-line comment in recipe_config.py); locate the existing copyright comment and
change the year token "2025" to "2026" so the file header reflects the current
year.

In `@nemo_rl/models/megatron/setup.py`:
- Around line 298-303: The FileNotFoundError message for pretrained_run_config
is truncated; update the exception text (in the block that checks
os.path.exists(pretrained_run_config) where pretrained_run_config and rank are
referenced) to complete the sentence — e.g., explain that the one-time HF->mcore
conversion on rank=0 may have written to a directory not mounted on this node
and instruct the user to verify that the directory is mounted and that rank 0
wrote the config (include the path and rank variables in the message for
clarity).
- Around line 493-496: The validators access megatron_cfg.optimizer without
checking for None; add a guard at the start of _validate_optimizer_config (and
similarly in _validate_dtype_config) to return early if megatron_cfg.optimizer
is None (or otherwise bail with a no-op) so you don't dereference None when
reading optimizer_cpu_offload, optimizer_offload_fraction,
use_precision_aware_optimizer, bf16 or fp16; reference the
megatron_cfg.optimizer object in those functions and ensure any downstream logic
expects a present optimizer only after this check.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 281-293: Remove the debug block that prints "HELLO", writes a
hardcoded YAML path, and calls sys.exit during initialization: delete the
unconditional print("HELLO") and the entire rank-0 conditional that calls
self.megatron_cfg.to_yaml(...), imports sys, prints debug/warning messages, and
invokes sys.exit(0). Ensure no hardcoded paths remain and do not perform any
early process termination in __init__; if you need to persist the config,
replace this with a safe, optional debug hook (e.g., gated by a proper config
flag) that runs non-destructively and is synchronized across ranks rather than
exiting the process.

🧹 Nitpick comments (4)

examples/configs/recipes/llm/grpo-dapomath17k-dsv3-megatron.yaml (1)
1-59: Filename doesn't follow the LLM recipe naming convention.

The current filename grpo-dapomath17k-dsv3-megatron.yaml is missing the <nodes>n<gpus>g component required by the naming pattern. Based on the cluster config (num_nodes: 32, gpus_per_node: 8), a compliant name would look something like:
grpo-dsv3-32n8g-megatron-dapomath17k.yaml
This is pre-existing and not introduced by this PR, so feel free to defer. As per coding guidelines, "For LLM recipes, follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)".
examples/configs/grpo_math_qwen30ba3b_megatron.yaml (1)
29-30: Nit: trailing blank line after megatron_recipe.

Line 30 appears to be a changed line that's just whitespace. Consider removing the trailing blank line to keep the config tidy.
Proposed fix
   megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_30b_pretrain_config
-  
   megatron_cfg:
nemo_rl/models/megatron/setup.py (2)
592-604: _update_dataclass_fields silently ignores unknown keys — could mask config typos.

hasattr on line 603 means a misspelled key (e.g., learning_rate instead of lr) is silently dropped. Consider logging a warning for unrecognized keys to aid debugging.
♻️ Proposed improvement
 def _update_dataclass_fields(target: Any, updates: dict) -> None:
     for key, value in updates.items():
         if hasattr(target, key):
             setattr(target, key, value)
+        else:
+            warnings.warn(
+                f"Ignoring unknown config key '{key}' for {type(target).__name__}. "
+                "Check for typos in megatron_cfg overrides."
+            )
632-637: Hidden default value 1 for train_global_batch_size.

Line 635 uses config.get("train_global_batch_size", 1) which introduces a hidden default. Per coding guidelines, YAML should be the single source of truth for defaults, and required config values should be accessed directly. Even though the comment says this is "ignored by NeMo-RL", Megatron itself may use it for internal validation. Consider using config["train_global_batch_size"] to surface misconfiguration early. As per coding guidelines: "Access required config values directly (e.g., policy_cfg['precision']) and assume they are present; do not introduce hidden defaults in code."
📝 Proposed fix
-    megatron_cfg.train.global_batch_size = config.get("train_global_batch_size", 1) # ignored by NeMo-RL
+    megatron_cfg.train.global_batch_size = config["train_global_batch_size"]  # ignored by NeMo-RL

coderabbitai · 2026-02-06T18:26:22Z

examples/configs/grpo_math_70B_megatron.yaml

+  megatron_recipe: megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for any llama3.3 or llama33 references in the codebase echo "=== Searching for llama3.3 or llama33 references ===" rg -i 'llama3[._]?3' --type yaml --type py -g '!*.pyc' 2>/dev/null | head -50 echo "" echo "=== Searching for all llama pretrain_config references ===" rg 'llama.*pretrain_config|llama.*config' --type py --type yaml 2>/dev/null | head -50 echo "" echo "=== Searching for megatron.bridge.recipes.llama directory/references ===" find . -path '*/megatron*' -name '*llama*' 2>/dev/null | grep -E '\.(py|yaml)$' | head -30

Repository: NVIDIA-NeMo/RL

Length of output: 3275

🏁 Script executed:

#!/bin/bash # Read the full file to see if there are comments explaining the recipe choice echo "=== Full content of grpo_math_70B_megatron.yaml ===" cat examples/configs/grpo_math_70B_megatron.yaml echo "" echo "=== Search for TODOs or FIXMEs related to llama or recipe ===" rg -n 'TODO|FIXME' examples/configs/grpo_math_70B_megatron.yaml echo "" echo "=== Check megatron.bridge documentation or references ===" rg -n 'megatron.*bridge.*llama' --type py --type md 2>/dev/null | head -20 echo "" echo "=== Search for Llama 3.3 in model name references ===" rg -n 'Llama-3.3|Llama 3.3' --type yaml 2>/dev/null

Repository: NVIDIA-NeMo/RL

Length of output: 2726

Update model to Llama 3.1 or verify recipe compatibility with Llama 3.3.

Line 9 specifies meta-llama/Llama-3.3-70B-Instruct, but line 25 references llama31_70b_pretrain_config (Llama 3.1). No llama33_70b_pretrain_config exists in the codebase. Either change the model to meta-llama/Llama-3.1-70B-Instruct to match the recipe, or provide an alternative Llama 3.3–compatible recipe reference. All other similar configurations (e.g., grpo_math_8B_megatron.yaml, sft_openmathinstruct2_megatron.yaml) use matching model and recipe versions.

🤖 Prompt for AI Agents

In `@examples/configs/grpo_math_70B_megatron.yaml` around lines 25 - 26, The config mixes Llama 3.3 model with a Llama 3.1 recipe: update either the model string or the recipe to match versions—specifically change the model from "meta-llama/Llama-3.3-70B-Instruct" to "meta-llama/Llama-3.1-70B-Instruct" to match the existing megatron_recipe value "llama31_70b_pretrain_config", or replace "megatron_recipe: megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config" with a Llama 3.3–compatible recipe if you have one; ensure consistency across similar configs like grpo_math_8B_megatron.yaml and sft_openmathinstruct2_megatron.yaml.

coderabbitai · 2026-02-06T18:26:22Z

...es/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml

    ${.megatron_cfg.context_parallel_size}}, 2}
  megatron_cfg:
    enabled: true
+  megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all Qwen3 related recipe files and configs echo "=== Searching for Qwen3 recipe definitions ===" find . -type f $ -name "*.py" -o -name "*.yaml" $ | xargs grep -l "qwen3.*recipe\|qwen3.*config" | head -20 echo -e "\n=== Looking for available Qwen3 megatron recipes ===" rg "qwen3.*pretrain_config|def.*qwen3.*config" --type py -A 2 | head -50 echo -e "\n=== Checking for 1.7b Qwen3 recipe variants ===" rg "qwen3.*1[._]7" --type py --type yaml echo -e "\n=== Checking the specific recipe module path ===" find . -path "*megatron/bridge/recipes/qwen/qwen3*" -type f | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 3262

🏁 Script executed:

#!/bin/bash # Check the file in question and nearby distillation configs echo "=== Current file content ===" head -40 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml echo -e "\n=== Check other distillation YAML files for megatron_recipe usage ===" find examples/configs/recipes/llm -name "*distillation*.yaml" -type f | xargs grep -l "megatron_recipe" echo -e "\n=== Show megatron_recipe usage in distillation configs ===" find examples/configs/recipes/llm -name "*distillation*.yaml" -type f -exec grep -H -B 2 -A 2 "megatron_recipe" {} \;

Repository: NVIDIA-NeMo/RL

Length of output: 2029

🏁 Script executed:

#!/bin/bash # Search for how megatron_recipe is used in the codebase (in Python code) echo "=== How megatron_recipe is referenced in code ===" rg "megatron_recipe" --type py -B 3 -A 3 | head -100 echo -e "\n=== Check if there's a mapping or factory for Qwen3 recipes ===" rg "qwen3_.*pretrain_config|qwen3_.*model_config" --type py

Repository: NVIDIA-NeMo/RL

Length of output: 2617

🏁 Script executed:

#!/bin/bash # Search for megatron.bridge recipe implementations echo "=== Searching for megatron bridge recipes ===" find . -path "*megatron*bridge*recipe*" -o -path "*megatron*recipe*" | grep -E "(\.py|\.yaml)$" | head -20 echo -e "\n=== Check if there's a separate install for megatron-bridge ===" ls -la requirements*.txt pyproject.toml setup.py 2>/dev/null | head -10 echo -e "\n=== Search for any import of megatron.bridge ===" rg "from megatron|import megatron" --type py -B 1 -A 1 | head -40 echo -e "\n=== Check the distillation_math.yaml base config ===" cat examples/configs/distillation_math.yaml | head -100

Repository: NVIDIA-NeMo/RL

Length of output: 6630

🏁 Script executed:

#!/bin/bash # Check if there are test files or other distillation configs that might hint at expected models echo "=== Check other distillation YAML configs to understand model/recipe patterns ===" find examples/configs/recipes/llm -name "*distillation*.yaml" -type f -exec basename {} \; | sort echo -e "\n=== Look at the FSDP variant for comparison ===" head -40 examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml

Repository: NVIDIA-NeMo/RL

Length of output: 1702

🏁 Script executed:

#!/bin/bash # Check if the policy model_name is specified elsewhere or if it's inferred echo "=== Search for how policy model_name is determined ===" rg "model_name|student|policy" nemo_rl/models/ --type py -B 2 -A 2 | grep -A 5 -B 5 "1.7\|1_7\|student" | head -60 echo -e "\n=== Check if megatron_recipe can override model size or just provide base config ===" rg "megatron_recipe|load_recipe" nemo_rl/models/ --type py -B 3 -A 3 | head -80

Repository: NVIDIA-NeMo/RL

Length of output: 4475

🏁 Script executed:

#!/bin/bash # Search for any Qwen3 recipe definitions in megatron-bridge or available recipes echo "=== Search for all available Qwen3 recipes ===" rg "qwen3.*pretrain_config|def.*qwen3_.*_config" -g "*.py" echo -e "\n=== Check pyproject.toml for megatron-bridge dependency and version ===" grep -A 10 -B 2 "megatron" pyproject.toml echo -e "\n=== Check if there's a mapping of model sizes to recipes ===" rg "1.7|1_7|1\.7|4b|8b|32b" nemo_rl/models/ --type py | grep -i recipe

Repository: NVIDIA-NeMo/RL

Length of output: 2883

🏁 Script executed:

#!/bin/bash # Check the git history or comments around this change echo "=== Check if there are any recent changes or TODOs about this recipe ===" rg "TODO|FIXME|XXX|qwen3_32b_pretrain_config" nemo_rl/ examples/ --type py --type yaml -B 2 -A 2 echo -e "\n=== Compare the megatron recipe variant with FSDP variant more carefully ===" echo "=== FSDP (without megatron_recipe) ===" grep -A 30 "^policy:" examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml | head -40 echo -e "\n=== Megatron (with megatron_recipe) ===" grep -A 30 "^policy:" examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml | head -40

Repository: NVIDIA-NeMo/RL

Length of output: 24191

Policy uses 32B megatron_recipe but should use 1.7B recipe for the student model.

The policy (student model) is configured with megatron_recipe: megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config, which provides the 32B model architecture. However, the policy model name inherited from base config is Qwen/Qwen3-1.7B-Base (1.7B). This mismatch will cause the student to be loaded with the wrong model architecture. Replace with a 1.7B recipe, or if no 1.7B recipe exists, remove the recipe and use the HuggingFace model directly like the FSDP variant does.

🤖 Prompt for AI Agents

In `@examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.yaml` at line 25, The policy's megatron_recipe is pointing to the 32B config (megatron.bridge.recipes.qwen.qwen3.qwen3_32b_pretrain_config) while the policy model name is Qwen/Qwen3-1.7B-Base; change megatron_recipe to the matching 1.7B recipe (if one exists) or remove the megatron_recipe entry so the policy uses the HuggingFace model directly (like the FSDP variant) to ensure the student model architecture matches Qwen/Qwen3-1.7B-Base.

coderabbitai · 2026-02-06T18:26:22Z

examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml

-      lr_warmup_init: 5.0e-08
-    distributed_data_parallel_config:
-      overlap_grad_reduce: false
+  megatron_recipe: megatron.bridge.recipes.qwen.qwen2.qwen2_5_vl_3b_pretrain_config


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if the megatron bridge recipe module exists and inspect its TP default echo "=== Searching for qwen2_5_vl_3b_pretrain_config ===" rg -rn 'qwen2_5_vl_3b_pretrain_config' --type py -C5 echo "" echo "=== Searching for tensor_model_parallel_size in related recipe files ===" fd 'qwen2' --extension py --exec grep -Hn 'tensor_model_parallel' {} \; echo "" echo "=== Check base defaults file for TP setting ===" fd 'vlm_grpo_3B.yaml' --exec cat {} \;

Repository: NVIDIA-NeMo/RL

Length of output: 7552

🏁 Script executed:

# Also check the actual YAML file being reviewed echo "=== Content of the reviewed YAML file ===" cat -n examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml

Repository: NVIDIA-NeMo/RL

Length of output: 589

Add explicit tensor_model_parallel_size: 2 override or correct the filename.

The filename advertises megatrontp2 but the verifiable configuration shows no TP=2 setting. The base defaults file specifies tensor_parallel_size: 1, and the reviewed file contains no override for it. Either add an explicit tensor_model_parallel_size: 2 under the policy section to match the filename, or rename the file to reflect the actual TP configuration (e.g., remove or change the megatrontp segment).

Per coding guidelines, the strategy segment in the VLM recipe naming pattern must accurately reflect the parallelism configuration.

🤖 Prompt for AI Agents

In `@examples/configs/recipes/vlm/vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n8g-megatrontp2.v1.yaml` at line 11, The recipe filename advertises TP=2 but the config lacks that override; update the policy section to explicitly set tensor_model_parallel_size: 2 (ensuring it overrides the base default tensor_parallel_size: 1) so the effective parallelism matches the filename, or alternatively rename the file to remove the "megatrontp2" segment; locate the megatron_recipe entry and the policy block in the YAML (look for keys named policy, tensor_model_parallel_size, and tensor_parallel_size) and add the explicit tensor_model_parallel_size: 2 entry or adjust the filename accordingly.

nemo_rl/models/megatron/recipe_config.py

coderabbitai · 2026-02-06T18:26:23Z

nemo_rl/models/megatron/setup.py

+        if not os.path.exists(pretrained_run_config):
+            raise FileNotFoundError(
+                f"Pretrained run config not found at {pretrained_run_config} on rank={rank}. "
+                "This usually means that the one-time HF->mcore conversion on rank=0 saved to a directory "
+                "not being mounted on this node. Please check"
+            )


⚠️ Potential issue | 🟡 Minor

Incomplete error message — sentence ends with "Please check".

The FileNotFoundError message is truncated mid-sentence.

📝 Proposed fix

raise FileNotFoundError( f"Pretrained run config not found at {pretrained_run_config} on rank={rank}. " "This usually means that the one-time HF->mcore conversion on rank=0 saved to a directory " - "not being mounted on this node. Please check" + "not being mounted on this node. Please check that the shared filesystem is accessible." )

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 299-303: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@nemo_rl/models/megatron/setup.py` around lines 298 - 303, The FileNotFoundError message for pretrained_run_config is truncated; update the exception text (in the block that checks os.path.exists(pretrained_run_config) where pretrained_run_config and rank are referenced) to complete the sentence — e.g., explain that the one-time HF->mcore conversion on rank=0 may have written to a directory not mounted on this node and instruct the user to verify that the directory is mounted and that rank 0 wrote the config (include the path and rank variables in the message for clarity).

coderabbitai · 2026-02-06T18:26:23Z

nemo_rl/models/megatron/setup.py

+def _validate_optimizer_config(megatron_cfg: ConfigContainer) -> None:
    """Validate optimizer configuration."""
-    optimizer_cpu_offload = config["megatron_cfg"]["optimizer"]["optimizer_cpu_offload"]
-    optimizer_offload_fraction = config["megatron_cfg"]["optimizer"][
-        "optimizer_offload_fraction"
-    ]
+    optimizer_cpu_offload = megatron_cfg.optimizer.optimizer_cpu_offload
+    optimizer_offload_fraction = megatron_cfg.optimizer.optimizer_offload_fraction


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and the specific lines mentioned wc -l nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 95

🏁 Script executed:

# Look at lines 493-496 and surrounding context sed -n '490,510p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1016

🏁 Script executed:

# Look at lines 640-645 which allegedly set optimizer only when overrides exist sed -n '635,650p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 861

🏁 Script executed:

# Search for where _validate_optimizer_config is called rg -n '_validate_optimizer_config' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 181

🏁 Script executed:

# Look at context around line 354 where _validate_optimizer_config is called sed -n '345,365p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 760

🏁 Script executed:

# Look for where megatron_cfg is initialized before line 354 sed -n '320,360p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1341

🏁 Script executed:

# Search for any defaults or initialization of optimizer in the file rg -n 'megatron_cfg.optimizer\s*=' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 137

🏁 Script executed:

# Let's check what _validate_dtype_config does with the optimizer parameter sed -n '470,495p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1142

🏁 Script executed:

# Check for _validate_dtype_config definition rg -n 'def _validate_dtype_config' nemo_rl/models/megatron/setup.py -A 10

Repository: NVIDIA-NeMo/RL

Length of output: 688

🏁 Script executed:

# Check if optimizer is loaded from base config or recipes rg -n 'OptimizerConfig\|optimizer.*=' nemo_rl/models/megatron/setup.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Let's see the full _validate_dtype_config function to see how it handles optimizer_cfg sed -n '559,580p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1193

🏁 Script executed:

# Check if there's a None guard in _validate_dtype_config rg -n 'if optimizer_cfg' nemo_rl/models/megatron/setup.py -A 2 -B 2

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Check the complete _update_megatron_config function to see all optimizer initialization paths rg -n 'def _update_megatron_config' nemo_rl/models/megatron/setup.py -A 50 | head -60

Repository: NVIDIA-NeMo/RL

Length of output: 2431

🏁 Script executed:

# Check what recipes are available and if they include optimizer config find . -name "*.yaml" -path "*example*" -o -name "*.yaml" -path "*recipe*" | grep -i recipe | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 1758

🏁 Script executed:

# Look for recipe definitions fd -e yaml . --path "*recipe*" | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 291

🏁 Script executed:

# Check example configs to see if optimizer is typically included fd -e yaml . examples/configs/ 2>/dev/null | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 429

🏁 Script executed:

# Check a few recipes to see if they define optimizer config cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.yaml | grep -A 20 "megatron_cfg:"

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Check another megatron-based recipe cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml | grep -A 30 "megatron_cfg:" | head -40

Repository: NVIDIA-NeMo/RL

Length of output: 312

🏁 Script executed:

# Check a base example config cat examples/configs/dpo.yaml | grep -A 50 "megatron_cfg:"

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Let's search for optimizer config in recipe files rg -l "optimizer:" examples/configs/recipes/llm/ | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 429

🏁 Script executed:

# Check if any recipe has optimizer in megatron_cfg rg "megatron_cfg:" examples/configs/recipes/llm/ -A 30 | grep -B5 -A5 "optimizer" | head -30

Repository: NVIDIA-NeMo/RL

Length of output: 2533

🏁 Script executed:

# Look for a more complete megatron recipe file wc -l examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml head -100 examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml

Repository: NVIDIA-NeMo/RL

Length of output: 976

🏁 Script executed:

# Let's check if the recipe loading from megatron.bridge initializes optimizer # First, let's see what megatron_recipe parameter does rg -n 'megatron_recipe' nemo_rl/models/megatron/setup.py -B2 -A2

Repository: NVIDIA-NeMo/RL

Length of output: 542

🏁 Script executed:

# Check how megatron_cfg is loaded from the recipe rg -n 'import_module_from_config\|load.*recipe\|get_megatron_config' nemo_rl/models/megatron/setup.py -B2 -A2

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Search for where megatron_cfg is initially created/loaded before _update_megatron_config sed -n '280,320p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1883

🏁 Script executed:

# Let's check what load_recipe does and where it comes from rg -n 'def load_recipe\|from.*load_recipe\|import.*load_recipe' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Check imports at the top of the file head -50 nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 1672

🏁 Script executed:

# Let's check if there's a test or example that shows what happens when optimizer is not provided rg -n 'optimizer.*None\|None.*optimizer' nemo_rl/models/megatron/setup.py -B2 -A2 | head -40

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Search for load_recipe import rg -n 'load_recipe' nemo_rl/models/megatron/setup.py -B5 | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 645

🏁 Script executed:

# Look at the complete imports section sed -n '1,80p' nemo_rl/models/megatron/setup.py | tail -40

Repository: NVIDIA-NeMo/RL

Length of output: 1595

🏁 Script executed:

# Let's look at what _validate_dtype_config actually does with a None optimizer sed -n '559,600p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 2247

🏁 Script executed:

# Let's check the order of calls and see all validation functions sed -n '350,365p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 523

🏁 Script executed:

# Check if _validate_dtype_config can handle None optimizer_cfg rg -n '_validate_dtype_config\|optimizer_cfg' nemo_rl/models/megatron/setup.py | grep -A2 -B2 "354\|355\|559"

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Look for where optimizer might be None and see if there's any guard rg -n 'optimizer.*None' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 255

🏁 Script executed:

# Check if all recipes in examples actually include optimizer config rg "megatron_cfg:" examples/configs/recipes/llm/*.yaml -A 20 | grep -c "optimizer:"

Repository: NVIDIA-NeMo/RL

Length of output: 60

🏁 Script executed:

# Let's see line 662 and surrounding context to understand another place that checks optimizer is not None sed -n '660,680p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 906

🏁 Script executed:

# Let's check what happens at line 813 sed -n '810,830p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 892

🏁 Script executed:

# Let's trace through - what does load_recipe return? Check the recipe_config module cat nemo_rl/models/megatron/recipe_config.py | head -50

Repository: NVIDIA-NeMo/RL

Length of output: 1816

🏁 Script executed:

# Let's check what happens when a recipe doesn't have optimizer - is that possible? # Look at what load_recipe returns and check recipe examples rg -n "def load_recipe" nemo_rl/models/megatron/recipe_config.py -A 30

Repository: NVIDIA-NeMo/RL

Length of output: 1315

🏁 Script executed:

# Check if recipes without optimizer in megatron_cfg are possible # Let's look at a simple FSDP recipe that doesn't have megatron_cfg.optimizer cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n4g-fsdp2tp1-quick.v2.yaml

Repository: NVIDIA-NeMo/RL

Length of output: 274

🏁 Script executed:

# Let's check what happens when we have a recipe without optimizer and see if there's actual flow # Look for tests that might show this scenario find . -name "*test*.py" -path "*/megatron/*" | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 209

🏁 Script executed:

# Let's check the test file for megatron setup cat tests/unit/models/megatron/test_megatron_setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 39612

🏁 Script executed:

# Let's also check what the actual parent yaml for the FSDP recipe contains cat examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp2-quick.v2.yaml | grep -A 30 "megatron_cfg:"

Repository: NVIDIA-NeMo/RL

Length of output: 40

🏁 Script executed:

# Check if there are non-megatron recipes that may have None optimizer rg "megatron_recipe:" examples/configs/ -l

Repository: NVIDIA-NeMo/RL

Length of output: 1888

🏁 Script executed:

# Let's look more closely at the test - specifically how it's calling the function sed -n '1010,1030p' tests/unit/models/megatron/test_megatron_setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 827

🏁 Script executed:

# Check if there are any recipes without megatron_recipe that use megatron_cfg rg "megatron_cfg:" examples/configs/ -l | while read file; do if ! grep -q "megatron_recipe:" "$file"; then echo "=== $file ===" grep -A 10 "megatron_cfg:" "$file" | head -15 fi done

Repository: NVIDIA-NeMo/RL

Length of output: 4802

🏁 Script executed:

# Now let's trace the actual execution path - what happens at line 354? # Check if there's guaranteed initialization of optimizer before _validate_optimizer_config sed -n '280,365p' nemo_rl/models/megatron/setup.py | grep -A5 -B5 "optimizer\|_validate"

Repository: NVIDIA-NeMo/RL

Length of output: 920

🏁 Script executed:

# Let's check what happens when a recipe like grpo-deepseek-v3-32n8g doesn't have optimizer # Look for how recipes set optimizer and whether it defaults to something sed -n '287,295p' nemo_rl/models/megatron/setup.py

Repository: NVIDIA-NeMo/RL

Length of output: 455

🏁 Script executed:

# Let's trace the actual execution - what does load_recipe return? # Check if recipes without optimizer defined leave it as None rg "megatron.bridge.recipes" examples/configs/ | head -3

Repository: NVIDIA-NeMo/RL

Length of output: 452

🏁 Script executed:

# Let's check if there's a default recipe function to understand what load_recipe returns python3 << 'EOF' # Simulate what happens when loading a recipe # The recipe functions are imported from megatron.bridge # If a recipe function is called and doesn't set optimizer, what happens? # We need to understand ConfigContainer behavior import sys sys.path.insert(0, '/tmp') # We can't import megatron.bridge in this environment, but we can reason about it # based on the code structure shown # From the code, load_recipe calls a recipe function that returns a ConfigContainer # If the recipe doesn't set optimizer, megatron_cfg.optimizer would be whatever the # default is in ConfigContainer - likely None print("When a recipe is loaded:") print("1. load_recipe() calls the recipe function") print("2. Recipe function returns a ConfigContainer") print("3. If recipe doesn't initialize optimizer, it would be None by default") print("") print("Then in _update_megatron_config:") print("1. If no optimizer overrides in config, the None value is NOT modified") print("2. So megatron_cfg.optimizer remains None") print("") print("Then at line 352: _validate_dtype_config accesses optimizer_cfg.use_precision_aware_optimizer") print("Then at line 354: _validate_optimizer_config accesses optimizer.optimizer_cpu_offload") print("Both would crash if optimizer is None") EOF

Repository: NVIDIA-NeMo/RL

Length of output: 607

🏁 Script executed:

# Check if _validate_dtype_config has any None guard sed -n '559,590p' nemo_rl/models/megatron/setup.py | grep -i "none\|if.*optimizer"

Repository: NVIDIA-NeMo/RL

Length of output: 349

Add guard against None optimizer before accessing its attributes.

When a Megatron-Bridge recipe doesn't define an optimizer configuration, megatron_cfg.optimizer remains None. The code then crashes at lines 352 and 354 when attempting to access optimizer attributes without checking:

Line 352: _validate_dtype_config() accesses optimizer_cfg.use_precision_aware_optimizer and optimizer_cfg.bf16/optimizer_cfg.fp16

Line 354: _validate_optimizer_config() accesses optimizer.optimizer_cpu_offload and optimizer.optimizer_offload_fraction

The issue occurs because _update_megatron_config() (lines 640–645) only modifies optimizer if config overrides are present; if the recipe lacks optimizer config and no overrides are specified, the None value remains unchanged.

🛡️ Proposed fixes

For _validate_optimizer_config:

def _validate_optimizer_config(megatron_cfg: ConfigContainer) -> None: """Validate optimizer configuration.""" + if megatron_cfg.optimizer is None: + return optimizer_cpu_offload = megatron_cfg.optimizer.optimizer_cpu_offload optimizer_offload_fraction = megatron_cfg.optimizer.optimizer_offload_fraction

For _validate_dtype_config, add the same guard at the start, or ensure optimizer is never passed as None.

🤖 Prompt for AI Agents

In `@nemo_rl/models/megatron/setup.py` around lines 493 - 496, The validators access megatron_cfg.optimizer without checking for None; add a guard at the start of _validate_optimizer_config (and similarly in _validate_dtype_config) to return early if megatron_cfg.optimizer is None (or otherwise bail with a no-op) so you don't dereference None when reading optimizer_cpu_offload, optimizer_offload_fraction, use_precision_aware_optimizer, bf16 or fp16; reference the megatron_cfg.optimizer object in those functions and ensure any downstream logic expects a present optimizer only after this check.

coderabbitai · 2026-02-06T18:26:23Z

nemo_rl/models/policy/workers/megatron_policy_worker.py

+        print("HELLO")
+        # Dump ConfigContainer to YAML for inspection (only on rank 0)
+        if self.rank == 0:
+            config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml"
+            try:
+                self.megatron_cfg.to_yaml(config_dump_path)
+                print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}")
+            except Exception as e:
+                print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}")
+            # Exit early after dumping config for inspection
+            import sys
+            print("[DEBUG] Exiting after ConfigContainer dump")
+            sys.exit(0)


⚠️ Potential issue | 🔴 Critical

Remove debug code: sys.exit(0) terminates the worker during __init__, deadlocking all other ranks.

This block:

Prints "HELLO" unconditionally on all ranks.

On rank 0, writes to a hardcoded user-specific Lustre path, then calls sys.exit(0) — killing the process mid-initialization.

All non-zero ranks will hang indefinitely at the next collective operation, causing a distributed deadlock.

The late import sys inside the conditional is also non-idiomatic.

This is development/debug code that must be removed before merging. As per coding guidelines, debug artifacts should be removed before merging.

🗑️ Proposed fix: remove debug block

- print("HELLO") - # Dump ConfigContainer to YAML for inspection (only on rank 0) - if self.rank == 0: - config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml" - try: - self.megatron_cfg.to_yaml(config_dump_path) - print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}") - except Exception as e: - print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}") - # Exit early after dumping config for inspection - import sys - print("[DEBUG] Exiting after ConfigContainer dump") - sys.exit(0)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print("HELLO")

# Dump ConfigContainer to YAML for inspection (only on rank 0)

if self.rank == 0:

config_dump_path = "/lustre/fsw/portfolios/coreai/users/sfawzy/final_megatron_config_6.yaml"

try:

self.megatron_cfg.to_yaml(config_dump_path)

print(f"[DEBUG] Saved final ConfigContainer to: {config_dump_path}")

except Exception as e:

print(f"[WARNING] Failed to save ConfigContainer to YAML: {e}")

# Exit early after dumping config for inspection

import sys

print("[DEBUG] Exiting after ConfigContainer dump")

sys.exit(0)

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 288-288: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 281 - 293, Remove the debug block that prints "HELLO", writes a hardcoded YAML path, and calls sys.exit during initialization: delete the unconditional print("HELLO") and the entire rank-0 conditional that calls self.megatron_cfg.to_yaml(...), imports sys, prints debug/warning messages, and invokes sys.exit(0). Ensure no hardcoded paths remain and do not perform any early process termination in __init__; if you need to persist the config, replace this with a safe, optional debug hook (e.g., gated by a proper config flag) that runs non-destructively and is synchronized across ranks rather than exiting the process.

root and others added 4 commits February 6, 2026 10:17

Initial commit.

88a26c7

test

ddd4151

.

824fbdb

fix.

35cb86e

sfawzy-nv requested review from a team as code owners February 6, 2026 18:18

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

Sherif Fawzy and others added 4 commits February 9, 2026 11:16

recipes/llm/performance complete.

0b88ccc

recipe/llm done.

64e681b

more fixes.

ef54ee3

Multiple fixes.

1ee4ebb

sfawzy-nv marked this pull request as draft February 10, 2026 04:48

guyueh1 self-requested a review February 10, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894

Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894

Uh oh!

sfawzy-nv commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 6, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		megatron_recipe: megatron.bridge.recipes.llama.llama3.llama31_70b_pretrain_config

Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894

Are you sure you want to change the base?

Draft: Use Megatron-Bridge recipes for megatron_cfg. #1894

Uh oh!

Conversation

sfawzy-nv commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 6, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sfawzy-nv commented Feb 6, 2026 •

edited by coderabbitai bot

Loading